Statistical Modeling

We will go through the following process: Explore data -> Fit model -> Evaluate model -> Deploy model

Step 1. Explore Data

Perform exploratory analysis on the variables using the whole data set. Describe the data and comment on your observations/findings.

Key Points of the Dataset:

  1. Demographic Variables: Such as age, sex, and race/ethnicity.
  2. Health Metrics: Including glycohemoglobin (GH), weight, height, BMI, various body circumferences, skinfold measurements, and blood markers like albumin, blood urea nitrogen, and creatinine.
  3. Medical History: Information about whether the participants are on insulin or diabetes medications.

Response Variable:

The response variable in this dataset is likely to be diabetes, which is a binary variable indicating whether the participant has diabetes (often derived from the glycohemoglobin measurements, typically GH >= 6.5%).

diabetes is a response variable.

On_Insulin_or_Diabetes Meds is a binary variable.

Race_Or_Ethnicity is a catogory variables.

The rest of the vaaluable are numeric or quantiative predictors.

Some of the predictors have missing values such as income_min, income_max, Suvscapular_Sk etc.

Descriptive Statistics

Heatmap Visualization: The heatmap provides a visual representation of the Pearson correlation coefficients between pairs of variables in the dataset.

Color Scale: The color scale ranges from -1 (dark blue) to +1 (dark red), indicating the strength and direction of the correlation. Red signifies a strong positive correlation, blue signifies a strong negative correlation, and white indicates no correlation.

The heatmap displays the Pearson correlation coefficients between various features in the dataset. Here's a brief description of the heatmap:

  1. Diagonal Elements: All diagonal elements are 1, indicating that each feature is perfectly correlated with itself.

  2. Target Variable (diabetes):

    • gh (Glycohemoglobin) has a high positive correlation with diabetes (0.52).
    • On_Insulin_or_Diabetes_Meds also has a strong positive correlation with diabetes (0.78).
  3. High Positive Correlations:

    • Income_Min and Income_Max have a perfect positive correlation (0.99), suggesting they represent similar information.
    • Upper_Leg_Length_cm and Upper_Arm_Length_cm have a high positive correlation (0.76).
  4. High Negative Correlations:

    • Triceps_Skinfold_mm and Sex have a moderately strong negative correlation (-0.51).
  5. General Observations:

    • Most features have low to moderate correlations with each other, indicating that they are relatively independent.
    • There are some notable clusters of higher correlations among anthropometric measurements (e.g., BMI, Height_cm, Weight_Kg).
    • Age shows a moderate positive correlation with Blood_Urea_Nitrogen_mg/dL and a moderate negative correlation with Height_cm.

The pairplot or scatterplot matrix is used to visualize the pairwise relationships between multiple variables in a dataset. Here’s a brief description of the plot:

  1. Distribution of Individual Variables:

    • The histograms or KDE plots along the diagonal show the distribution of each variable.
    • Some variables appear to be normally distributed, while others show skewness or bimodal distributions.
  2. Linear Relationships:

    • There are several pairs of variables that exhibit a clear linear relationship, as indicated by the elliptical shape of the scatter plots. This suggests a strong correlation between these pairs.
  3. Non-linear Relationships:

    • Some scatter plots show curved patterns, indicating non-linear relationships between variables.
  4. Clusters:

    • There are several scatter plots where data points are clustered into distinct groups, suggesting the presence of subgroups within the data.
  5. Outliers:

    • Certain scatter plots show outliers that deviate significantly from the main cluster of data points.
  6. Discrete Variables:

    • Some scatter plots exhibit a striped pattern, indicating that one or both variables are discrete or categorical.
  7. Symmetry:

    • The pair plot is symmetrical across the diagonal, meaning the scatter plot for variable 𝐴 vs. 𝐵 is mirrored by the plot for 𝐵 vs. 𝐴

The scatter plots in the upper and lower triangles provide a visual way to detect linear or non-linear relationships, clusters, outliers, and other patterns within the data. The pair plot is especially useful for exploring the relationships in a dataset with multiple numerical variables. These observations can help in understanding the underlying structure of the data, identifying patterns and relationships, and guiding further analysis or modeling efforts.

Step 2. Fit Model (default or basic)

Split the data set into training set and testing set in a (approximate) ratio 75:25. Set random state/seed using the last 4 digits of your SP admission number. Fit the full additive MLR model on the training set.

Build a full MLR model for Response Varible (diabetes) using the predictors (i.e Race, BMI, gh:Glycohemoglobin etc).

Step 3: Evaluate Model

Conduct relevant diagnostics on the full MLR model fitted. Evaluate the model from the perspectives of model fit, prediction accuracy, model/predictor significance, and checking of assumptions.

How well does the model fit the data? Is the model likely to be useful for prediction? Are any of the basic assumptions violated?

We can perform the following dianostics:

Check goodness of fit ($R^2$) and accuracy (MSE)

61%, moderate fit, of the variability , Adj, R-Sqr is 60.9%

Check assumptions: normality

We can use visualizations or run statistical tests to check if residuals satisfy the normality assumption.

The plot shown is a histogram with an overlaid kernel density estimate (KDE) plot. Here’s a detailed description of the plot based on the dataset:

General Description

Observations

Distribution:

Skewness:

Density:

Outliers:

Interpretation

Central Tendency:

Bimodal Nature:

Data Spread:

Possible Variable Context

Given the nature of the distribution, the variable could represent a standardized or normalized measure where most data points cluster around a mean value of 0. The secondary peak around 1 might indicate a different category or group within the data that exhibits distinct behavior.

Conclusion

This plot provides valuable insights into the distribution of the variable, highlighting its central tendency, skewness, and the presence of bimodal characteristics. Understanding this distribution is crucial for further analysis, such as identifying relationships with other variables or preparing the data for modeling.

To check normality or residuals:

This QQ plot show errors are not normally distributed.

To perform the Jarque-Bera test on residuals, type: sm.stats.jarque_bera(residname)

The function returns: JB test statistic, p-value, estimated skewness, estimated kurtosis

To perform the Omnibus test on residuals, type: sm.stats.omni_normtest(residname)

Conduct Shapiro-Wilk normality test. The function returns test statistics and P-value.

Two ways to to conduct Anderson-Darling normality test

One uses the scipy library, and one uses the statsmodel library

scipy: returns test statistic and critical values. Reject H0 if test statistic > critical value

statsmodel: returns test statistics and P-value

Check Assumption: Homoscedasticity (constant variance)

We can use visualization or run statistical test to check if residuals satisfy the homoscedasticity assumption.

The points show two linear model which is inappropriate.

Reject H0 (p-value = 1.1201)

Non-constant variance, even non-linear effect

Check assumptions: independence

No autocorrelation, However, we can see two patterns.

Perform the Durbin-Watson test on residuals. Type: sm.stats.durbin_watson(residname)

Multicollinearity

We can check for multicollinearity in the data set by:

VIF Interpretation
VIF = 1 Not correlated
1<VIF<5 Moderately correlated
VIF > 5 Highly correlated

No serious multicollinearity, continue to check

Condition Number is high (5068760.6)

Check Multicollinearity (VIF)

To print out all VIFs by a "for" loop:

For Sex, Age, Race and On_Insulin_or_Diabetes_Med, Up_Leg_Len, Up_Arm_Len are not correlated

On_Insulin_or_Diabetes_Meds, Triceps_Sk, Subscapular_Sk, Alibumin, Blood_Urea, Creatinin and gh (Glycohemoglobin) are moderately correlated

Income_Min and Income_Max, Weight_Kg, Height_cm, BMI, Arm_Cir, Waist_Cir, are highly correlated

Step 4: Improve Model

Improve the model using at least 4 of the following techniques where appropriate:

Explain how the model is improved after applying each of the techniques.

1. Removing Outlier(s) (if any) from the residuals

The 2nd model has no difference to previous model.

However, MSE is about the same.

Perform Backward Selection based on p-Values. Removed predictor with p-Values higher than 0.05

The MSE of the models are slightly getting lower after removing the predictors (p-value > 0.05).

Adj. R2 has slightly improved. (62%)

The multicollinearity problem has improved. It becomes moderately correlated for individual predictors (1<VIF<5).

3. Create interaction of variables

Adj. R2 has been improved. (62.4%)

The MSE of this model is getting lower. (0.0432)

Cond. No. is 4.16e+03.

4. Transform variables (5th Model)

Apparently, Normailty is still not stablized for the variance.

Using multilinear regression for a binary response variable is not appropriate because multilinear regression assumes a continuous response variable. For binary response data, logistic regression is typically used instead. Here’s why:

Multilinear Regression

When the response variable is binary (e.g., 0 or 1), these assumptions are violated. Specifically:

- The response variable is not continuous.
- The error terms do not follow a normal distribution.
- The relationship between predictors and the binary response is not necessarily linear.

Logistic Regression

Conclusion

This is a show case to use statistcal functions to demostrate the use of visualization with the NHGH dataset in python notebook. For a binary response variable, logistic regression is the appropriate choice. It accounts for the binary nature of the response variable and models the probability of the occurrence of an event.